-
Notifications
You must be signed in to change notification settings - Fork 13.7k
Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition #17332
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Add testcase or it didn't happen. :) |
Look, children, that's how an evil maintainer looks like :P Will never let you off the hook with any PR, ever! |
update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
CONT(type=f32,ne=[10,10,10,1]): OK
CONT(type=f32,ne=[2,1,1,1]): OK
CONT(type=f32,ne=[2,1,3,5]): OK
CONT(type=f32,ne=[2,3,5,7]): OK
CONT(type=f16,ne=[2,1,1,1]): OK
CONT(type=f16,ne=[2,1,3,5]): OK
CONT(type=f16,ne=[2,3,5,7]): OK
CONT(type=bf16,ne=[2,1,1,1]): OK
CONT(type=bf16,ne=[2,1,3,5]): OK
CONT(type=bf16,ne=[2,3,5,7]): OK
[CONT] NMSE = 0.447623183 > 0.000000100 CONT(type=f32,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 2.241813873 > 0.000000100 CONT(type=f32,ne=[1,8,17,1]): FAIL
[CONT] NMSE = 0.058848433 > 0.000000100 CONT(type=bf16,ne=[1,4,2,1]): FAIL
[CONT] NMSE = 1.181509486 > 0.000000100 CONT(type=bf16,ne=[1,8,17,1]): FAIL
10/14 tests passedvs update_cuda_graph_executable: CUDA graph update failed
ggml_backend_cuda_graph_compute: disabling CUDA graphs due to too many consecutive updates
update_cuda_graph_executable: CUDA graph update failed
CONT(type=f32,ne=[10,10,10,1]): OK
CONT(type=f32,ne=[2,1,1,1]): OK
CONT(type=f32,ne=[2,1,3,5]): OK
CONT(type=f32,ne=[2,3,5,7]): OK
CONT(type=f16,ne=[2,1,1,1]): OK
CONT(type=f16,ne=[2,1,3,5]): OK
CONT(type=f16,ne=[2,3,5,7]): OK
CONT(type=bf16,ne=[2,1,1,1]): OK
CONT(type=bf16,ne=[2,1,3,5]): OK
CONT(type=bf16,ne=[2,3,5,7]): OK
CONT(type=f32,ne=[1,4,2,1]): OK
CONT(type=f32,ne=[1,8,17,1]): OK
CONT(type=bf16,ne=[1,4,2,1]): OK
CONT(type=bf16,ne=[1,8,17,1]): OK
14/14 tests passed
Backend CUDA0: OK@CISC There :) |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fix and sorry for not catching the bug during review; some functionality that would have covered this was removed and the logic was not adjusted.
Preferably add an argument for the existing tests for GGML_OP_CONT rather than adding a new test case.
|
@JohannesGaessler okay, integrated it into the CONT test, added a flag |
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have the necessary permissions to hit the merge button or do I need to do it?
Nope, don't have write access. But I'll address the Evil Maintainer's concerns first :D |
|
Alright, can merge. But you know @CISC that once you gave me this idea I'm going to run around and refactor all the tests now to use loops? :P |
|
@pwilkin please do a more general refactor of tests in a dedicated PR. |
Of course, it was meant as a generalized |
|
(and BTW, don't worry, I'm not opening a new refactoring till I'm done with the |
Great, I'll keep you busy with changes there then. 👿 |
The very first failed CI job I checked failed because the WebGPU backend crashes on the newly added tests. Please remove the tests from this PR, then we can merge only the fix for CUDA and make a new PR with the tests that we can merge once all backends work correctly. |
It doesn't actually fail on the test BTW, it fails because with F16 the view offset of |
|
@CISC @JohannesGaessler So wait, should I remove the F16 testcase or is it actually something that should be fixed on the WebGPU backend? |
You can change it to a tensor where the view offset is a byte-multiple of 4 for now. |
|
Okay, I've taken out the F16/BF16 tests with BTW, I think the problem might be more subtle, because the tensor {1, 4, 2, 1} is a multiple of 4. What happens is that the view slice offset calculated is possibly not a subset of 4. |
Yeah, I edited my misleading comment. :) |
|
@JohannesGaessler WebGPU succeeded this time so I think it's safe to merge. |
Yep, looks like this is a bug in the WebGPU WebGPU has more stringent alignment requirements for portability. This code considers the cases where the offset + size is not a multiple of 4, but not when the starting offset is not a multiple of 4! Should be a relatively easy fix. I might prefer to fix it in the same PR that adds back the test cases that currently fail, so I am happy to work with whoever creates that PR on the fix. |
|
@pwilkin my opinion is that we should merge the fix for CUDA earlier rather than later. That is the important part; the main purpose of the tests is to avoid having the same breakage again later on. But if the addition of a test case breaks the tests themselves we should put those into a different PR in the meantime. And we should definitely be keeping the problematic case for webGPU rather than work around it. |
|
@JohannesGaessler yeah, I kept the regression tests, I just added a skip (and TODO) for the WebGPU cases. |
|
@reeselevine Once this gets merged you can just remove the continue block with the TODO comment for the failing tests. |
See #16095 (comment) for case.